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Document databases may be ill-formed, 
containing redundant and poorly organized 
documents. For example, a database of 
customers’ descriptions of problems with 
products and the vendor’s descriptions of 
their resolution may contain many 
descriptions of the same problem. A highly 
desirable goal is to transform the database 
into a concise set of summarized reports— 
model cases—which in turn are more 
amenable to search and problem resolution 
without expert intervention. In this paper, we 
describe techniques for attempting to 
automate the procedures for reducing a 
database to its essential components. Our 
initial application is self help for resolution of 
product problems. A lightweight document 
clustering method is described that operates 
in high dimensionality, processing tens of 
thousands of documents and grouping them 
into several thousand clusters. Techniques 
are described for summarization and 
exemplar selection to further refine the 
database contents. The method has been 
evaluated on a database of over 100000 
customer-service problem reports that are 
reduced to 3000 clusters and 5000 exemplar 
documents. Preliminary results are promising 
and demonstrate efficient clustering 
performance with excellent group similarity 
measures, reducing the original database size 
by several orders of magnitude. 


An ill-formed document repository may contain doc¬ 
uments covering the same topics and documents 
composed of unfocused text. Ideally, we would like 
to reduce the size of this database by eliminating the 
redundant documents and summarizing the remain¬ 
ing documents. To remove the redundant docu¬ 
ments, we consider the use of high-dimensional 
clustering techniques. To cleanse the remaining doc¬ 
uments, we consider both knowledge extraction tech¬ 
niques and document summarization methods. Au¬ 
tomated procedures cannot be expected to perform 
these tasks perfectly. However, we can find real- 
world circumstances where imperfect results will still 
provide large benefits. 

We introduce these concepts by way of a help-desk 
example, where users submit problems or queries on 
line to the vendor of a product. Each submission can 
be considered a document. By clustering the doc¬ 
uments, the vendor can obtain an overview of the 
types of problems that the customers are having; for 
example, a computer vendor might discover that 
printer problems comprise a large percentage of cus¬ 
tomer complaints. Typically, the number of clusters 
or categories number no more than a few hundred 
and often less than 100. 

Not all users of a product report unique problems 
to the help desk. It can be expected that most prob- 
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lem reports are repeat problems, with many users 
experiencing the same difficulty. If enough users re¬ 
port the same problem, a model-case report can be 
created. To reduce the number of documents in the 
database of problem reports, redundancies in the 
documents must be detected. Unlike the summary 

^■ 

Our goal is to describe techniques 
that reduce complexity 

and redundancy in a database 
of ill-structured documents. 


of problem types, many problems will be similar but 
still have distinctions that are critical. Thus, while 
the number of clusters needed to eliminate dupli¬ 
cation of problem reports can be expected to be much 
smaller than the total number of problem reports, 
the number of clusters is necessarily relatively large, 
much larger than the 100 clusters needed for sum¬ 
marization of problem types. 

Ideally, the clusters will contain documents that ad¬ 
dress the same problem and present a solution. Many 
customers report the same software problem, and 
they receive the same fix. Looking at the individual 
reports within a cluster, we may see some variability 
in their quality. Some reports may be concise, almost 
directly decomposing into a problem statement and 
solution. Others, such as those in IBM’s call centers 
for software problems, are almost complete tran¬ 
scripts of customer and service representative dis¬ 
cussions. In these cases, individual documents may 
include much text that does not relate to the ulti¬ 
mate problem resolution. The central purpose of the 
database is simply to maintain records of a custom¬ 
er’s interaction with a service representative. Our ul¬ 
timate goal is to summarize an individual document 
or to extract the relevant sections from multiple doc¬ 
uments, yielding a model-case report for the prob¬ 
lem. If that goal is achieved, the potential for self 
help by customers is greatly increased. 

In this paper, we describe a machine-learning ap¬ 
proach to automatic generation of model-case re¬ 
ports. Redundant documents are detected by high¬ 
dimensional clustering. Summaries of clusters can 
be found either by (1) topic summarization of mul¬ 
tiple documents 1 or by (2) exemplar selection and 
excerpt extraction. Results show that the process can 
greatly reduce the size of a database while maintain¬ 


ing much of its integrity. The process is not perfect, 
but need not be to demonstrate efficacy. 

Document clustering techniques 

The classical k-means technique 2 can be applied to 
document clustering. Its weaknesses are well known. 
The number of clusters, k, must be specified prior 
to application. The summary statistic is the mean of 
the values for each cluster. The individual members 
of a cluster can have a high variance, and the mean 
may not be a good summary of the cluster members. 
As the number of clusters grows, for example to thou¬ 
sands of clusters, classical k-means clustering be¬ 
comes untenable. 3 

More recent attention has been given to hierarchi¬ 
cal agglomerative methods. 4 The documents are re¬ 
cursively merged bottom-up, yielding a decision tree 
of recursively partitioned clusters. The distance mea¬ 
sures used to find similarity vary from single-link to 
more computationally expensive ones, but they are 
closely tied to nearest-neighbor distance. The algo¬ 
rithm works by recursively merging the single best 
pair of documents or clusters, making the compu¬ 
tational costs prohibitive for document collections 
numbering in the tens of thousands. 

To cluster very large numbers of documents, pos¬ 
sibly with a large number of clusters, some compro¬ 
mises must be made to reduce the number of indexed 
words and the number of expected comparisons. In 
Larsen and Aone, 5 indexing of each document is re¬ 
duced to the 25 highest scoring TF-IDF (term frequen¬ 
cy-inverse document frequency 6 ), and then k-means 
is applied recursively, for k = 9. While efficient, this 
approach has the classical weaknesses associated with 
k-means document clustering. A hierarchical tech¬ 
nique that also works in steps with a small, fixed num¬ 
ber of clusters is described in Cutting. 7 

We describe a lightweight procedure that operates 
efficiently in high dimensions and is effective in di¬ 
rectly producing clusters that have objective similar¬ 
ity. Unlike k-means clustering, the number of clus¬ 
ters is dynamically determined, and similarity is based 
on nearest-neighbor distance, not mean feature dis¬ 
tance. Thus, the document clustering method main¬ 
tains the key advantage of hierarchical clustering 
techniques—their compatibility with information re¬ 
trieval methods—yet performance does not rapidly 
degrade for large numbers of both documents and 
clusters. 
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Our goal is to describe techniques that reduce com¬ 
plexity and redundancy in a database of ill-structured 
documents. For many applications, our clustering 
method is not demonstrably optimal or necessarily 
superior to other clustering techniques. Flowever, 
our technique operates effectively in an application 
requiring thousands of clusters, which cannot be said 
for many strong, alternative clustering methods, such 
as the agglomerative clustering methods. 8 Most im¬ 
portantly though, clustering is just a means to an end: 
the refinement of an ill-structured database that will 
be accessed by users who encounter problems sim¬ 
ilar to those previously encountered by other users. 

Methods and procedures 

Clustering algorithms process documents in a trans¬ 
formed state, where the documents are represented 
as a collection of terms or words. A vector repre¬ 
sentation is used: in the simplest format, each ele¬ 
ment of the vector is the presence or absence of a 
word. The same vector format is used for each doc¬ 
ument; the vector is a space taken over the complete 
set of words in all documents. Clearly, a single doc¬ 
ument has a sparse vector over the set of all words. 
Some processing may take place to stem words to 
their essential root and to transform the presence 
or absence of a word to a score, such as TF-IDF, that 
is a predictive distance measure. In addition, weakly 
predictive words, stopwords, 9 are removed. These 
same processes can be used to reduce indexing fur¬ 
ther by measuring, for a document’s vector, only m 
top-scoring words in a document and setting all re¬ 
maining vector entries to zero. 

An alternative approach to selecting a subset of fea¬ 
tures for a document assumes that documents are 
carefully composed and have effective titles. 10 Title 
words are always indexed, along with the m most fre¬ 
quent words in the document and any human-as- 
signed key words. 

Not all words are of the same predictive value, and 
many approaches have been tried for selecting a sub¬ 
set of words that are most predictive. The main con¬ 
cept is to reduce the number of overall words that 
are considered, which reduces the representational 
and computational tasks of the clustering algorithm. 
Reduced indexing can be effective in these goals 
when performed prior to clustering. The clustering 
algorithm accepts as input the transformed data, as 
in any information retrieval system, and works with 
a vector representation that is a transformation of 
the original documents. 


Table 1 Definitions for top-k scoring algorithm 

doclist: The words (terms) in each document. A series of 
numbers; documents are separated by zeros. Example : 
Sequence = 10 44 98 0 24 .. . The first document has 
words 10, 44, and 98. The second document has words 

24... 

wordlist: The documents in which a word is found; a 
series of consecutive numbers pointing to specific 
document numbers. 

word(c): A pointer to wordlist indicating the starting 
location of the documents for word c. To process all 
documents for word c, access word(c) through 
word(c + 1) — 1. Example : word(l) = 1, word(2) = 4; 
wordlist = 18 22 64 16. . . Word 1 appears in the 
documents listed in locations 1, 2, and 3 in wordlist. 
The documents are 18, 22, and 64. 

pv(c) : Predictive values of word c = 1 + idf, where idf is 
l/(number of documents where word c appears) 


Clustering methods. Our method uses a reduced in¬ 
dexing view of the original documents, where only 
the m best keywords of each document are indexed. 
That reduces a document’s vector size and the com¬ 
putation time for distance measures for a clustering 
method. Our procedure for clustering is specified in 
two parts, (1) compute k most similar documents 
(typically the top 10) for each document in the col¬ 
lection and (2) group the documents into clusters 
using these similarity scores. To be fully efficient, both 
procedures must be computationally efficient. Find¬ 
ing and scoring the k most similar documents for each 
document will be specified as a mathematical algo¬ 
rithm that processes fixed scalar vectors. The pro¬ 
cedure is simple, a repetitive series of loops that ac¬ 
cesses a fixed portion of memory, leading to efficient 
computation. The second procedure uses the scores 
for the k most similar documents in clustering the 
document. Unlike the other algorithms described 
earlier, the second clustering step does not perform 
a “best-match first-out” merging. It merges docu¬ 
ments and clusters based on a “first-in first-out” ba¬ 
sis. 

Table 1 describes the data structures needed to pro¬ 
cess the algorithms. Each of these lists can be rep¬ 
resented as a simple linear vector. Table 2 describes 
the steps for the computation of the k most similar 
documents for each document in the collection. Sim¬ 
ilarity or distance is measured by a simple additive 
count of words found in both documents that are 
compared, plus their inverse document frequency. 
This differs from the standard TF-IDF formula in that 
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Table 2 Steps for top-k scoring algorithm 

1. Get the next document’s words (from doclist), and set 
all document scores to zero. 

2. Get the next word, w, for current document. If no 
words remain, store the k documents with the highest 
scores and continue with step 1. 

3. For all documents having word w (from wordlist), add 
to their scores and continue with step 2. 


Table 3 Actions for clustering document pairs 

1. If score for D, and is less than minimum score, 
evaluate next pair. 

2. If D, and D are already in the same cluster, evaluate 
next pair. 

3. If D, is in a cluster and D ; is not, add D, to the D, 
cluster and evaluate next pair. 

4. Cluster merge step: If both D, and D, are in separate 
clusters: 

(a) If action plan is “no merging,” evaluate next pair. 

(b) If action plan is “repeat documents,” repeat D, in 
all the D, clusters and evaluate next pair. 

(c) Merge the D, cluster with the D ; cluster and 
evaluate next pair. 


term frequency is measured in binary terms, that is, 
0 or 1 for presence or absence. In addition the val¬ 
ues are not normalized, just the sum is used. In a 
comparative study, * 1 2 3 4 * * * * * 10 we show that TF-IDF has slightly 

stronger predictive value, but the simpler function 
has numerous advantages in terms of interpretabil- 
ity, simple additive computation, and elimination of 
storage of term frequencies. The steps in Table 2 can 
readily be modified to use TF-IDF scoring. 

The remaining task is to group the documents into 
clusters using these similarity scores. We describe a 
single-pass algorithm for clustering, with at most k *n 
comparisons of similarity, where n is the number of 
documents. 

For each document D,, the scoring algorithm pro¬ 
duces a set of k documents, {D,}, where j varies from 

1 to k. Given the scores of the top-k matches of each 
document D, , Table 3 describes the actions that may 
be taken for each matched pair during cluster for¬ 
mation. Documents are examined in a pairwise fash¬ 
ion, starting with the first document and its top-k 
matches. Matches below a preset minimum score 


threshold are ignored. Clusters are formed by the 
document pairs not yet in clusters. Clusters are 
merged when documents in the matched pair appear 
in separate clusters. As we will see later, not allow¬ 
ing merging yields a very large number of clusters 
containing highly similar documents. The setting of 
the minimum score has a strong effect on the num¬ 
ber of clusters; a high value produces a relatively 
large number of clusters and a zero value produces 
a relatively small number of clusters. Similarly, a high 
minimum score may leave some documents unclus¬ 
tered, whereas a low value clusters all documents. 
As an alternative to merging, it may be preferable 
to repeat the same document in multiple clusters. 
We do not report results on this form of duplica¬ 
tion, typically done for smaller numbers of docu¬ 
ments, but the procedure provides an option for du¬ 
plicating documents across clusters. 

Measures for evaluation of clustering results. How 

can we objectively evaluate clustering performance? 
Very often, the objective measure is related to the 
clustering technique. For example, k-means cluster¬ 
ing can measure overall distance from the mean. 
Techniques that are based on nearest-neighbor dis¬ 
tance, 11 such as most information retrieval tech¬ 
niques, can measure distance from the nearest neigh¬ 
bor or the average distance from other cluster 
members. 

For our clustering algorithm, most distance measure¬ 
ment is in terms of counts of words present in doc¬ 
uments. A natural measure of cluster performance 
is the average number of indexed words per cluster, 
that is, the local dictionary size. Analogous measures 
of cluster “cohesion” that count the number of com¬ 
mon words among documents in a cluster have been 
used to evaluate performance. 12 The average is com¬ 
puted by weighing the number of documents in the 
cluster as in Equation 1, where N is the total num¬ 
ber of documents, m is the number of clusters, size, 
is the number of documents in the ith cluster, and 
LDict, is the number of indexed words in the / th clus¬ 
ter. 


MZ.C, 

average = 2, LDict, (1) 

i=i 

Results of clustering are compared to documents 
randomly assigned to clusters of the same size. 
Clearly, the average dictionary size for computed 
clusters should be much smaller than those for ran- 


424 WEISS AND APTE 


IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 









domly assigned clusters of the same number of doc¬ 
uments. 

Summarization and excerpt extraction 

The goal of document clustering is to separate the 
documents into groups of similar documents. Clus¬ 
tering does not reduce the number of documents or 
the size of the document repository. The individual 
documents remain the same. Two basic approaches 
for eliminating redundancy and reducing the size of 
the database are (1) selecting exemplars from each 
cluster, and (2) providing a summary document for 
each cluster. 

If the documents in a cluster are redundant—for ex¬ 
ample, each describes the same self-help problem 
and solution—then selecting one document from the 
cluster can be sufficient to describe the cluster. Be¬ 
cause the clustering procedure is imperfect and some 
clusters may be large, it may be safer to select more 
than one exemplar document to represent the clus¬ 
ter, that is, the problem-solution pair. 

An alternative to the exemplar summarization tech¬ 
nique is topic summarization. 1 Each cluster contains 
documents for the same topic. Unlike single docu¬ 
ment summarization, summarizing documents on a 
common topic is sample-based and explores com¬ 
mon patterns across many documents. 

Critical section extraction. The method for docu¬ 
ment reduction just described is completely auto¬ 
mated. However, its success depends on the quality 
and clarity of the original documents. It is best, es¬ 
pecially for exemplar-based summarization, to have 
the original documents stripped to their bare essen¬ 
tials—for self-help, stripped to the problem-solution 
pairs. One can readily envision many real-world sce¬ 
narios where the documents are poorly structured. 
Consider the following possibilities for our self-help 
example: 

• The user composes the problem statement and a 
customer representative writes a solution. Because 
a problem-solution model is expected, and the dis¬ 
cussants are asked to compose their thoughts in 
writing, the resultant document tends toward clar¬ 
ity and conciseness. 

• The user communicates by phone to a call center 
and the representative creates a real-time approx¬ 
imate transcript of the dialog. This is the actual 
situation at IBM’s call centers, where thousands of 
these documents are created by thousands of ser¬ 


vice representatives each day. Many documents are 
rambling, with extraneous text. 

For the first possibility, little additional preparation 
is needed. For the second possibility, additional ef¬ 
fort is required before automated procedures can 
extract the critical sections. Knowledge-based mod¬ 
els can be very helpful. For example, we know that 
the problem statement is typically at the beginning 
of the document and the solution is at the end. More¬ 
over, the service representatives are told to prefix 
critical sections with key words like “action taken.” 
Far more helpful, and far more powerful for a self- 
help document, would be for the customer repre¬ 
sentative to write a one- or two-line summary of the 
solution. This takes a little extra time, which reduces 
the amount of time available for a single repre¬ 
sentative to take more calls and is not needed when 
the main purpose of the document is to maintain a 
record of the customer’s problem. If the expectation 
is for redundancy, that is, recurring problems, then 
concise and consistent summarization by the authors 
warrants the additional effort, because of the long¬ 
term productivity gains in more quickly matching new 
problems to the stored documents in the database. 

Exemplar selection. The same measure of evalua¬ 
tion can be used to find exemplar documents for a 
cluster. The local dictionary of a document cluster 
can be used as a virtual document that is matched 
to the members of the cluster. The most frequently 
matched documents can be considered a ranked list 
of exemplar documents for the cluster. 

Selecting exemplar documents from a cluster is a 
form of cluster summarization. The technique for 
selecting the exemplars is based on matching the clus¬ 
ter’s dictionary of words to its constituent documents. 
The words themselves can provide another mode of 
summary for a cluster. The highest frequency words 
in the local dictionary of a cluster often can distin¬ 
guish a cluster from others. If only a few words are 
extracted, they may be considered a label for the clus¬ 
ter. 

Results 

To evaluate the performance of the clustering algo¬ 
rithms, we obtained a total of 51110 documents from 
customer reports for IBM AS/400* (Application 
System/400*) computer systems. These documents 
were constructed in real time by customer service 
representatives who recorded their phone dialogs 
with customers encountering problems. 
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Table 4 Results for clustering help-desk problems 


Cluster Average Dictionary Unclustered Minimum Merge 
Number Size Ratio Percentage Score 

49 

1027.3 

1.4 

1.5 

1 

Yes 

86 

579.6 

1.4 

2.5 

2 

Yes 

410 

105.5 

1.5 

16.2 

3 

Yes 

3250 

15.5 

1.8 

1.5 

1 

No 

3346 

14.9 

1.8 

2.5 

2 

No 

3789 

11.4 

1.9 

16.2 

3 

No 


The documents were indexed with a total of 21682 
words in a global dictionary computed from all the 
documents. Table 4 summarizes the results for clus¬ 
tering the document collection in terms of the num¬ 
ber of clusters, the average cluster size, the ratio of 
the local dictionary size to random assignment, the 
percentage of unclustered documents, the minimum 
score for matching document pairs, and whether or 
not clusters were merged. The first row in the table 
indicates that 49 clusters were found with an aver¬ 
age size of 1027 documents. A random cluster’s dic¬ 
tionary was on average 1.4 times larger than the gen¬ 
erated cluster; and 1.5 percent of the documents were 
not clustered. These results were obtained by using 
a minimum score of 1, and cluster merging was al¬ 
lowed. All results are for finding the top-ten doc¬ 
ument matches. 

A single clustering run, represented by one row in 
Table 4, currently takes 15 minutes on a 375 mega¬ 
hertz IBM RISC/6000* processor running AIX* (Ad¬ 
vanced Interactive Executive). The program is writ¬ 
ten in Java** code. 

Exemplar documents were selected for each of the 
3250 clusters found in the fourth row of the table. 
For some large clusters two or three exemplars were 
selected, for a total of 5000 exemplar documents. Us¬ 
ing the same scoring scheme, each of the exemplars 
was compared to the 51110 original documents. At 
least one exemplar matched 98.5 percent of the doc¬ 
uments, with at least one indexed word in common. 
The matching exemplar belonged to the assigned 
cluster for 80 percent of the documents. 

Discussion 

Our techniques for detecting and removing redun¬ 
dant documents from the repository use high-dimen¬ 
sional clustering. By some empirical measures, we 
can demonstrate that the process is effective in 
achieving our stated objectives. The process is ef¬ 


ficient in high dimensions, both for large document 
collections and for large numbers of clusters. No 
compromises are made to partition the clustering 
process into smaller subproblems. All documents are 
clustered in one stage. In the self-help application, 
it is important to remove duplication, while still main¬ 
taining a large number of exemplar documents. The 
help-desk clusters have strong similarity for their doc¬ 
uments, suggesting that they can be readily summa¬ 
rized by one or two documents. For the largest num¬ 
ber of clusters, dictionary size is nearly half that for 
random document assignment, far better than for a 
smaller number of clusters. 

Although our methods have many desirable prop¬ 
erties for operating in high dimensions, we have not 
demonstrated that the lightweight algorithm is op¬ 
timal in any sense. Moreover, we have not given any 
empirical comparisons to other clustering methods 
that show its superiority. For our central goal of ex¬ 
tracting the critical segments of documents in an ill- 
formed repository, the superiority of the clustering 
algorithm is not the only component of evaluation. 
We know that the extracted documents will be ac¬ 
cessed by a search engine that will give multiple an¬ 
swers in response to a query or to a document 
matcher that matches problem descriptions to stored 
documents. To be effective, we need a high-dimen¬ 
sional clustering method that operates in reasonable 
times, but we need not have perfect clustering to 
eliminate redundancy. We must consider the trade¬ 
off of database completeness with voluminous redun¬ 
dancy vs compactness with some missing documents. 

Help-desk applications may benefit from a reduc¬ 
tion in database size. Even when incomplete, a more 
precise database can help resolve many problems 
without having the user wade through dozens of re¬ 
peat examples before finding a relevant one. Resolv¬ 
ing the most frequent problems can help reduce the 
number of problems that must be resolved by cus¬ 
tomer representatives. From a technical perspective, 
these practical issues cannot be solved by optimal 
clustering, and a true evaluation can only be per¬ 
formed by real-world field-testing. 

For topic summarization, 1 researchers have reported 
some good preliminary results by k-means cluster¬ 
ing of sentences or paragraphs for the pooled doc¬ 
uments, and selecting those sentences that are most 
similar to a cluster’s mean vector. We have not yet 
tried this summarization technique. It remains a 
promising, but more complex, approach. The exem¬ 
plar approach keeps documents intact. Summariza- 
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tion by topic merges excerpts of many documents, 
and is therefore more susceptible to mistakes in the 
extraction of the excerpts. However, the exemplar 
approach is more dependent on starting with 
cleansed documents that contain the critical sections 
of the document, for example, the problem statement 
and solution pair. The topic summarization tech¬ 
nique has the potential to find the critical sections, 
because they will appear in many samples, while dis¬ 
carding sections that are unique to single documents. 

We have presented an initial approach to transform¬ 
ing ill-structured documents into a clear and unique 
set of filtered documents. In our application, cus¬ 
tomer problem reports become FAQs (frequently 
asked questions). We have addressed some of the 
major issues on this important application, and have 
presented specific techniques for solving this prob¬ 
lem. Additional research is needed, including com¬ 
parisons with other methods that might improve per¬ 
formance for this application. 

"Trademark or registered trademark of International Business 
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